About the project

This course “IODS-Introduction to Open Data Science” is organized by the Doctoral School of Social Sciences and Centre for Research Methods, University of Helsinki. The theme of the course is open data, open science and data science. The aim of this course is to understand the concepts of open data and open science and learn how to use some of the open tools available (=data science). Students will learn to analyze and visualize data with sophisticated statistical methods with these open access tools and get familiar with the concept of reproduciable science.

Link to my GitHub repository: https://github.com/anninahaverinen/IODS-project


Regression and model validation

Annina Haverinen, 7.11.2018 IODS-Project, Excercise 2

Getting started: Reading the data “learning2014” from local disk, the original data has been wrangled.

The learning2014 is a subset of a data from a study investigating the realationship between learning approaches and student achievements. The data learning2014 consists of 7 variables: gender (F/M), age, exam points, global attitude towards statistics, questions on surface learning, deep learning and strategic learning.There are 166 observations in this dataset representing 166 students.

learning2014<-read.table("/Users/Annina/Documents/GitHub/IODS-project/Data/learning2014.txt")

Looking at the dimensions,structure and first observations of the data.

dim(learning2014) # gives the dimensions of the dataset: 166 observations of 7 variables.
## [1] 166   7
str(learning2014) # gives the names of the variables, type of variable (factorial with number of levels, numeric or integer) and the first observations
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ surf    : num  2.64 3.09 2.18 2.27 2.82 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...

Next a summary of the variables:

summary(learning2014)
##  gender       Age            Points         Attitude          surf      
##  F:110   Min.   :17.00   Min.   : 7.00   Min.   :14.00   Min.   :1.545  
##  M: 56   1st Qu.:21.00   1st Qu.:19.00   1st Qu.:26.00   1st Qu.:2.455  
##          Median :22.00   Median :23.00   Median :32.00   Median :2.818  
##          Mean   :25.51   Mean   :22.72   Mean   :31.43   Mean   :2.783  
##          3rd Qu.:27.00   3rd Qu.:27.75   3rd Qu.:37.00   3rd Qu.:3.091  
##          Max.   :55.00   Max.   :33.00   Max.   :50.00   Max.   :4.273  
##       deep            stra      
##  Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.333   1st Qu.:2.625  
##  Median :3.667   Median :3.188  
##  Mean   :3.680   Mean   :3.121  
##  3rd Qu.:4.083   3rd Qu.:3.625  
##  Max.   :4.917   Max.   :5.000

From the summary we can see descriptive information of the variables. For factoral variables the quantity of the observations/level is given and for numeric variables the min-max values, 1st and 3rd quantile, mean and median. From this we can get an overview of the distribiution of the data/variable. From the summary it can be seen that “age” is skewed but the other variables quite normally distributed (normal distribiuton: mean almost equal to the median, quntiles symmetrical around mean, min-max symmetrical).

To get an idea over the possible relationships between the variables in the data we plot them in a plot matrix.

library(ggplot2) #installed for graphics
library(GGally) #installed for graphics
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20))) # plot matrix of the data

The plot matrix gives us scatter plots of all possible combinations of the variables in the data fram. The matrix also gives correlation coefficients for the combinations and density lines for estimation of distribition. We can visually interpret the distribution of the variables, in this case the distribiutions are more or less normal for the other variables than “age”. Below a correlation matrix.

p2<- ggcorr(learning2014 [-1]) # plots a correlation matrix, helps to visualize the correlated variables in the data set 
p2

From these plots we can see that see that the “Attitude”-“Points” interaction has the highest correlation coefficient. We see that “Attitude” and “stra” are positively correlated with “Points” and that “deep” is slightly negatively correlated with “Points”.

In this multiple regression model (named LM1) I therfore include “Attitude”, “stra” and “deep” (3 variables required in the excercise) as explanatory variables to “Points”. In the next model (LM2) I include only “Attitude” and “stra”, these two being positvely correlated to “Points”. The last model (LM3) is a simple linear regression model with only one explanatory variable, “Attitude”.

LM1<-lm(Points~Attitude+stra+deep, data=learning2014) 
LM1
## 
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = learning2014)
## 
## Coefficients:
## (Intercept)     Attitude         stra         deep  
##     11.3915       0.3525       0.9621      -0.7492
summary(LM1) #metrics indicating the quality of the model fit 
## 
## Call:
## lm(formula = Points ~ Attitude + stra + deep, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.39145    3.40775   3.343  0.00103 ** 
## Attitude     0.35254    0.05683   6.203 4.44e-09 ***
## stra         0.96208    0.53668   1.793  0.07489 .  
## deep        -0.74920    0.75066  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08
confint(LM1) # confidence intervalls of the coefficient estimate, from this we can also tell if the coeffiecent estimates are significant in our model. 
##                   2.5 %     97.5 %
## (Intercept)  4.66211614 18.1207860
## Attitude     0.24030902  0.4647614
## stra        -0.09769802  2.0218634
## deep        -2.23153872  0.7331395

Interpretation of the multiple linear regression model LM1:“Attitude”,“stra” and “deep” as explanatory variables for “Points”.

Residuals: (=how far from the regression line the observations are) shows how well data fits the model. The residual median should be close to mean (mean =0), and residuals should be normally distribiuted.

The Coefficient estimates are values of the slope calculated by the model regression. Signifcance levels for the coefficients estimates for the variables of the model are given with symbols and we see that the coefficient estimate is significant (*** = p = 0 ) for “Attitude”" and (“.” p= 0.05)" for “stra”. The estimate coefficient for “deep” is not significant for the model. High significance means that it is unlikely that “Attitude” is not correlated with “Points”.

Standard error of the coefficient estimate, should be small, but in relation to the coefficient estimate. It tells about the variance of the estimate coefficient.

T value, and p for T value tells that attitude coefficient estimate is valuable for the model.The other NS.

Residual standard error is a measure of variability of the residuals. The standard deviation of the residual, tells how well data fits model. It should be proprotional to the reisdual quantiles, normal dist stand errorx1.5 = the quantiles. In this case the SE of the residual is well in proportion to the quantiles.

Evaluation of “Goodness-of-fit of the model” Adjusted R squared is 19.5% (adjusted because of multiple regression, non adjusted can be used in simple regression). Meaning that 19.5% of the “Points”- can be explained by this model including the variables “Attitude”,“stra” and “deep”. High R squared indicates qood correlation, low bad. This tells how much of the phoenomena can be explained by the model = the strenght of the relationships between the model and the variables.

F test of the overall significance of the model: p value for the F statistic should be smaller than the P value for the coefficient estimate. Determines weather the Rsquared relationship is statistically significant. In this model the F p is bigger than the T p for “attitude”.

From the given output on the model I would exlude “deep” from the multiple regression model and make the model simpler with only two explinatory models.

LM2: “Attitude” and “stra” as explainatory variable for “Points”

LM2<-lm(Points~Attitude+stra, data=learning2014)
LM2
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = learning2014)
## 
## Coefficients:
## (Intercept)     Attitude         stra  
##      8.9729       0.3466       0.9137
summary(LM2)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
confint(LM2) # gives confidence intervalls of the coeffiecient estimate
##                  2.5 %     97.5 %
## (Intercept)  4.2418686 13.7039310
## Attitude     0.2349813  0.4581806
## stra        -0.1417267  1.9690301

LM2: The Rsquared adjusted is the same 19.5% but the F statistic is higher 21, with smaller p value, indicating a better fit of this model than LM1.

LM3: Next a simple regression model with only one explanatory variable “Attitude” for “Points”

LM3<-lm(Points~Attitude,data=learning2014)
LM3
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Coefficients:
## (Intercept)     Attitude  
##     11.6372       0.3525
summary(LM3)
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
confint(LM3) # gives confidence intervalls of the coeffiecient estimate
##                 2.5 %     97.5 %
## (Intercept) 8.0230738 15.2512331
## Attitude    0.2405135  0.4645785

LM3: Rsquared is essentieally unaffected 19% but F statistic 38.61, with a p value smaller that the estimate coefficient of “Attitude”. I would conclude that this simple model is better than those with multiple explinatory variables.

Below a scatterplot with the regression line (with CI95% pictured light grey around the regression line) plotted for the LM2 and LM3 models.

qplot(Attitude+stra, Points, data = learning2014) + geom_smooth(method = "lm")

qplot(Attitude, Points, data = learning2014) + geom_smooth(method = "lm")

Validation of the data for the model: Below the regression models LM2 and LM3 are subjected to diagnostic plots.

plot(LM2, which = c(1,2,5))

plot(LM3, which= c(1,2,5))

The diagnostic test are used to see how well our data fits the model. 1. Residual vs fitted values: Shows the residuals variance of error. In this case quite a good fit, we see no pattern in in the scatter plot. 2. Normal Q-Q plot: shows if residuals from our model are normally distribiuted, witch is an assumtion for the model. In this case some deviation from normal can be seen at extremes, but I interpret that data fits model well enough. 3.Residuals vs leverage: Helps us find influencial subjects that could have an impact on the regression line (=not all outliers are influencial i.e determine the regression line). Here no cases are seen outside cooks distance, i.e. there are not influencial extremes in the data.

The interpretation is that the data fits the regression model (both the multiple (2 variables) and simple regression model) and that the model is valiated for the data. ***

Logisitic regression and cross validation

This exercise is based on a data set about student achievement in secondary education in two Portuguese schools. Data includes student grades,demographic, social and school related features and it was collected by using school reports and questionnaires. The data “alc” is a subset of this data, with alcohol use on weekdays and weekends combined by averaging and high use defined >2 (scale: 1 = very low to 5 very high). Links to original dataset and publication.

Exploring the data

Readning the wrangled data from local disk.

alc<-read.table("/Users/Annina/Documents/GitHub/IODS-project/Data/alc.txt",sep=";") 

Next lets explore the data:

dim(alc) 
## [1] 382  35
colnames(alc) 
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data set is composed of 35 variables and 382 observations(ie.382 students)

summary(alc$high_use) # = alc_use >2 = TRUE 
##    Mode   FALSE    TRUE 
## logical     268     114

The response variable of interest is the binary variable “high_use” (TRUE/FALSE) correspoinding to alcohol use >2 on a scale from 1-5.

My hypotesis is that high alcohol use is correlated to the variables; “goout”,“sex” and “abscences” and that these could be used in a model for predicting high use. My hypothesis is that male students drink more than female students. People who go out more drink more alcohol and that people that have more abscences from school drink more alcohol.

Lets look at the chosen variables:

p1 <- ggplot(alc, aes(x=high_use, y = goout, col=sex))+geom_boxplot()+ylab("Going
out")+xlab("Alcohol high use")+ggtitle("Student going out by alcohol use and sex") 
p1 

In this boxplot we can se that male students go out more than female and that more going out seems positively related to high alcohol use.

p2 <- ggplot(alc, aes(high_use, absences, col=sex))+ylab("Absences")+xlab("Alcohol high use")+geom_boxplot()+ggtitle("Alcohol use by student absences and sex")

p2 #numeric 0-93 days

In this boxplot it looks like the students having high alcohol use have somewhat more absences.

table(alc$high_use,alc$sex)
##        
##           F   M
##   FALSE 156 112
##   TRUE   42  72

In this table we see that of the male students 38% and 21% of the women fall in the high alcohol use group. Of the whole data 30% are in the high use group.

p3 <- ggplot(alc, aes(high_use, col=sex))+geom_bar()+ylab(
"Students")+xlab("Alcohol high use")+ggtitle("Alcohol high use by sex") 
p3

In this barplot we se that more men than women are in the high use group.

Logistic regression model

model1<- glm(high_use~absences+sex+goout,
data=alc, family="binomial") 
summary(model1) 
## 
## Call:
## glm(formula = high_use ~ absences + sex + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7871  -0.8153  -0.5446   0.8357   2.4740  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.16317    0.47506  -8.764  < 2e-16 ***
## absences     0.08418    0.02237   3.764 0.000167 ***
## sexM         0.95872    0.25459   3.766 0.000166 ***
## goout        0.72981    0.11970   6.097 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.75  on 378  degrees of freedom
## AIC: 395.75
## 
## Number of Fisher Scoring iterations: 4
coef(model1) 
## (Intercept)    absences        sexM       goout 
## -4.16317087  0.08418399  0.95871896  0.72980859
ORmodel1<-coef(model1)%>%exp 
CImodel1<-confint(model1)%>%exp
## Waiting for profiling to be done...
ORCImodel1<-cbind(ORmodel1,CImodel1) 
ORCImodel1 
##               ORmodel1       2.5 %     97.5 %
## (Intercept) 0.01555815 0.005885392 0.03804621
## absences    1.08782902 1.042458467 1.13933894
## sexM        2.60835292 1.593132148 4.33151387
## goout       2.07468346 1.650182481 2.64111050

The residuals are more or less normally distribiuted and the median is close to 0. The coeffients for the variables all show statistical significance in the model. This means these are significantly predictive variables of high alcohol use.

Male sex is associated with a 2.6 times larger liklihood of high alcohol use than female sex. “goout” is associated with 2.07 times the odds of high alcohol use per increase in unit of “go out” and absences with 1.09 times the odds with one increase of unit in absences.

Predictions

probabilities <- predict(model1, type="response")
alc<-mutate(alc,probability=probabilities)
alc<-mutate(alc, prediction =probability>0.5)
dim(alc)
## [1] 382  37
select(alc, sex,goout,absences,probability,prediction)%>%tail(10) 
##     sex goout absences probability prediction
## 373   M     2        0  0.14869987      FALSE
## 374   M     3        7  0.39514446      FALSE
## 375   F     3        1  0.13129452      FALSE
## 376   F     3        6  0.18714923      FALSE
## 377   F     2        2  0.07342805      FALSE
## 378   F     4        2  0.25434555      FALSE
## 379   F     2        2  0.07342805      FALSE
## 380   F     1        3  0.03989428      FALSE
## 381   M     5        4  0.68596604       TRUE
## 382   M     1        2  0.09060457      FALSE
table(high_use=alc$high_use,prediction=alc$prediction) #confusion matrix #302 predictions wright/ 382 observations
##         prediction
## high_use FALSE TRUE
##    FALSE   253   15
##    TRUE     65   49

From this confusion matrix we can calculate that this model predicts 302/382 predictions right. 79% right, 21% wrong predicitions.

g <- ggplot(alc, aes(x = probability, y = high_use,
col=prediction))+geom_point() 
g

In this picture we see the same thing, the observations plotted by the probability and prediction and high_use. From this plot we can see that the model predicts low alcohol better than high alcohol use.

table(high_use = alc$high_use, prediction =
alc$prediction)%>%prop.table()%>%addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000

Accurancy of the Model

Loss of function: wrongly classified observations with this model

loss_func<-function(class,prob){
  n_wrong <-abs(class-prob)>0.5
  mean(n_wrong)
}
#mean number of wrongly classified observations = mean prediction error; 24,6% wrong predictions with this model on training data

loss_func(class = alc$high_use, prob = alc$probability) 
## [1] 0.2094241
loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293

20.9% are classified wrong by the prediction of this model. Guessing produces 29,8% wrong predicitions. This model is better than simply guessing.

Cross-Validation

library(boot) 
cv <-cv.glm(data = alc, cost = loss_func, glmfit = model1, K = 10) 
cv$delta[1] 
## [1] 0.2172775

22% is the mean predicition error of the model.

Classification and Clustering

Annina Haverinen 19.11.2018

Loading the MASS “Boston” dataset already loaded in R. Link to dataset
This is a data set about Housing Values in Suburbs of Boston.

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81-102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980). Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley

Variables in the dataset:

  1. crim= per capita crime rate by town.
  2. zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  3. indus = proportion of non-retail business acres per town.
  4. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. nox = nitrogen oxides concentration (parts per 10 million).
  6. rm = average number of rooms per dwelling.
  7. age = proportion of owner-occupied units built prior to 1940.
  8. dis = weighted mean of distances to five Boston employment centres.
  9. rad = index of accessibility to radial highways.
  10. tax = full-value property-tax rate per $10,000.
  11. ptratio = pupil-teacher ratio by town.
  12. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  13. lstat = lower status of the population (percent).
  14. medv = median value of owner-occupied homes in $1000s.
library(MASS) 
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
dim(Boston) #506 observation of 14 variables
## [1] 506  14
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Names of the variables in the data
506 observation of 14 variables

gather(Boston) %>% 
  ggplot(aes(value)) + facet_wrap("key", scales= "free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Visualisation of the variables. the variable rm is the only one looking normally distribiuted, the others seem rather skewed.

Correlation plots

library(tidyr)
library("ggplot2")
library("GGally")
library("corrplot")
## corrplot 0.84 loaded
cor_matrix <- cor(Boston)%>%round(2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
p1 <- ggcorr(cor_matrix,geom="circle")
# correlation plot matrix 
p1

?ggcorr
p2<- corrplot(cor_matrix, method = "circle",type="upper",cl.pos="b",tl.pos="d",tl.cex = 0.6)

p2
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

Looking at the correlations plot we can see that positively correlated variables are; rad (index of accesibility to radial highways), lstat (lower status of the population, percent) tax (full-value property-tax rate per $10,000) to crim (our variable of interest is crimerate per capita by town). Negatively associated with crim are medv (median value of owner-occupied homes in $1000s) and dis (weighted mean of distances to five Boston employment centres).

Scaling the dataset

scaled_boston <- scale(Boston)
summary(scaled_boston)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
summary(scaled_boston$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
crim_vector <- quantile(scaled_boston$crim) #new crime variable
crim_vector
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <-cut(scaled_boston$crim, breaks = crim_vector, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
scaled_boston <- dplyr::select(scaled_boston, -crim)
scaled_boston <- data.frame(scaled_boston,crime)
n <- nrow(scaled_boston)
n
## [1] 506
train <- sample(n, size = n * 0.8)
train_set <- scaled_boston[train,] # training set with 80% of obs.
dim(train_set)
## [1] 404  14
test <- scaled_boston [-train,] # test set with 20% of obs.
dim(test)
## [1] 102  14
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Linear discriminant analysis

  • classification method to find the linare combination of the variables that separates the target variable classes.
  • in this exercis the target variable is crime-rate
lda1<-lda(crime~., data=train_set)
lda1
## Call:
## lda(crime ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2425743 0.2549505 0.2574257 0.2450495 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       1.0528366 -0.9749369 -0.11163110 -0.9020874  0.51845273
## med_low  -0.1059251 -0.3396443 -0.04298342 -0.5648400 -0.13206300
## med_high -0.3948894  0.1848484  0.18195173  0.3793551 -0.02154002
## high     -0.4872402  1.0171737 -0.03371693  1.0707206 -0.39658083
##                 age        dis        rad        tax     ptratio
## low      -0.9156278  0.9727477 -0.6912387 -0.7567062 -0.46493272
## med_low  -0.3321160  0.3056852 -0.5436697 -0.4821863 -0.01802759
## med_high  0.4130206 -0.3554782 -0.4761040 -0.3510220 -0.21485482
## high      0.8092204 -0.8485658  1.6375616  1.5136504  0.78011702
##                black       lstat        medv
## low       0.38257007 -0.79565744  0.60421080
## med_low   0.31804177 -0.16601039  0.01817895
## med_high  0.06424352  0.08025286  0.06187999
## high     -0.74252566  0.91278445 -0.66681928
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.06913123  0.62253739 -0.87642543
## indus    0.12303584 -0.30386020  0.17547140
## chas    -0.04515286 -0.04074817  0.06291740
## nox      0.31124788 -0.66912601 -1.48795046
## rm       0.04157514 -0.01563936 -0.25867848
## age      0.13352783 -0.31042971 -0.18910935
## dis     -0.08030070 -0.14383989 -0.11632934
## rad      3.94011359  0.97891073 -0.07836587
## tax      0.10680198  0.06264412  0.69649928
## ptratio  0.12930532 -0.06535272 -0.27320048
## black   -0.05030945  0.05298142  0.13157895
## lstat    0.21240889 -0.20114242  0.14195983
## medv     0.07239768 -0.30023249 -0.24684186
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9615 0.0296 0.0090
classes <- as.numeric(train_set$crime) # 1,2,3,4 in stead of low,med_low,med_high, high

LDA biplot

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda1, dimen = 2, col = classes, pch = classes)
lda.arrows(lda1, myscale = 1)

##Predict with LDA

# predict classes with test data
lda.pred <- predict(lda1, newdata = test)
# cross tabulate the results
table(correct = lda.pred$class, prediction = correct_classes)%>%addmargins()
##           prediction
## correct    low med_low med_high high Sum
##   low       15       3        0    0  18
##   med_low   12      15        4    0  31
##   med_high   2       5       14    0  21
##   high       0       0        4   28  32
##   Sum       29      23       22   28 102

Prediction was right in 56% of test set
high crime was prdicted in 16/18 cases
med_high crime was predicted in 18/24
high crime was predicted in 26/27

library(MASS)
data("Boston")
scaled_boston <- scale(Boston)

K-means clustering

#determining the number of clusters
wss <- (nrow(scaled_boston)-1)*sum(apply(scaled_boston,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(scaled_boston, 
    centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

km<- kmeans(scaled_boston,3)
summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       42    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
km1 <- kmeans(scaled_boston, 5)
summary(km1)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       70    -none- numeric
## totss          1    -none- numeric
## withinss       5    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           5    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

3 clusters would be appropriate.

pairs(scaled_boston, col=km$cluster)

BONUS

data("Boston")
scaled_boston <- scale(Boston)
dim(scaled_boston)
## [1] 506  14
km_bonus <- kmeans(scaled_boston, 4)
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
lda2 <- lda(km_bonus$cluster~., data = scaled_boston)
lda2
## Call:
## lda(km_bonus$cluster ~ ., data = scaled_boston)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.3616601 0.2608696 0.2173913 0.1600791 
## 
## Group means:
##         crim         zn      indus        chas        nox         rm
## 1 -0.3909812 -0.2256428 -0.6123014 -0.07870119 -0.5097893  0.3443714
## 2  1.0632703 -0.4872402  1.0149946 -0.03371693  1.0159127 -0.3735788
## 3 -0.3215538 -0.4823678  0.6058983  0.19296460  0.4710723 -0.5944241
## 4 -0.4127310  1.9588740 -1.0935425 -0.02929819 -1.1435430  0.6380135
##          age        dis        rad        tax     ptratio       black
## 1 -0.3398029  0.2457676 -0.5676700 -0.7140001 -0.29244580  0.36383532
## 2  0.7542188 -0.8233749  1.6596029  1.5294129  0.80577843 -0.75124560
## 3  0.6869197 -0.5438170 -0.5851280 -0.2061098  0.06253067  0.03661229
## 4 -1.3942484  1.5250604 -0.6274061 -0.5993631 -0.73732772  0.35253338
##        lstat       medv
## 1 -0.5489002  0.4511972
## 2  0.8328654 -0.6664074
## 3  0.5963120 -0.4822750
## 4 -0.9269606  0.7215673
## 
## Coefficients of linear discriminants:
##                   LD1          LD2         LD3
## crim     0.0802880132 -0.188233567 -0.12940945
## zn       0.2778536432 -1.307489935  1.28053403
## indus    0.2885719633  0.232092639  0.66603797
## chas    -0.0233666154 -0.003197753  0.22741150
## nox      0.0001974368  0.093699623  0.35927729
## rm      -0.0538094053 -0.188523052 -0.15839928
## age      0.0383469023  0.635299039  0.17994768
## dis     -0.2692110931 -0.530377962 -0.01103552
## rad      5.9895792739 -1.226336654 -1.48557442
## tax      0.2063385825 -0.171020064  0.83656786
## ptratio  0.2700183626  0.161060370  0.31935417
## black   -0.0385016887 -0.022900357 -0.08539887
## lstat    0.0616549825  0.182880501  0.40175464
## medv    -0.0752439777 -0.077267369 -0.01891136
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8947 0.0879 0.0174
classes1 <- as.numeric(km_bonus$cluster)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda2, dimen = 2, col = classes1, pch = classes1)
lda.arrows(lda2, myscale = 1)

##superbonus

model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda1$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda1$scaling
matrix_product <- as.data.frame(matrix_product)
library("plotly")
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train_set$crime)